Bankruptcy dataset visualisation

The goal of this notebook is to explore by different ways the companies' bankruptcy data, analyse the general behavior of data and all form of correlation between the variables and finally to know what variables are useless and have to be ejected.

Libraries and dataset import

Dataset study

Data company repartition

Number of bankrupt companies in dataset

How many missing values ?

In these datasets, the missing values are depicted by a "?"

For the nth year

Per variable

Per company

Let's remove variables and companies with many missing values

Cleaned dataset

Dataset of the nth year with far fewer missing values

Replace the last omitted values by the mean

Variable behavior

To use the graph interaction, launch this notebook with jupyter

Predicator analysis

Boxplot of the 20 first variables depending on the bankruptcy variable

Correlation analysis

From this correlation plot and the analysis of the variable to explain, we can see that multiple variables are highly correlated, or can be avoided due to their variation, so we could ejected some of them without loss of information, in order to improve the performances of our future ML models. We will now check which are the highly correlated variables

Remember the correponding description ?

So for instance, we could say that X1 and X7 which are respectively the $ \frac{net\:profit}{total\:assets}$ and the $\frac{EBIT}{total\:assets}$ are highly correlated, it makes sense. Let's show some of correlated variables :

And now, let's realise a PCA to analyse which variables are the most efficients

PCA : Variable to focus

We start by standardize the data matrix and then apply a PCA

To see how many variables will really be needed to the future model, the fewer variables there are, the more our model will be easy to train and fast

It is shown here that only 23 variables would really be useful to represent the total information in this dataset, and this is due to a lot of redundancy in the variables.

Here, we won't show the graph of the individuals because of the size of the dataset, so we will only show the graph of variables

It's quite difficult to interpret this type of graph, it is confusing. However, we can see immediately that almost all variables are misrepresented because the arrow doesn't reach the boundary of a unit circle